This R notebook accompanies the QUANTBIO2019 workshop: BRIDGING THE GAP BETWEEN COMPUTATIONAL AND EXPERIMENTAL APPROACHES (EPFL – ETHZ Summer School 2019, Fiesch, Switzerland).

We will use the following R packages:

# load packages
# input/output
library(readxl, quietly = T)       # to read Excel files
library(R.utils, quietly = T)      # to read compressed files directly without unzipping

# data manipulation
library(data.table, quietly = T)   # for fast processing of large datasets
library(imputeTS, quietly = T)     # for data imputation
library(dtwclust, quietly = T)     # for dynamics time warping

# cluster analysis
library(NbClust)
library(cluster)
library(factoextra)

# visualisation
library(ggplot2, quietly = T)      # to make amazing plots
library(plotly, quietly = T)       # to make the amazing plots interactive
library(gplots, quietly = T)       # for plotting heatmaps from clustering
library(d3heatmap, quietly = T)    # for interactive heatmaps
library(dendextend, quietly = T)   # for manipulating dendrograms
library(RColorBrewer, quietly = T) # for nice color palettes
library(scales, quietly = T)       # for percentages on y-axis in ggplot
library(ggthemes, quietly = T)     # for additional ggplot themes and color palettes

Set global variables with parameters of the analysis:

# set some global variables

lPar = list(
  dirRoot = ".",     # path to root directory of this analysis
  dirData = "data",  # path to data directory, relative to root
  freqAcq = 2        # acquisition frequency in minutes
)

R primer

The R for data Science book and the accompanying website is an excellent resource to get acquainted with R and its applications to data science.

Here is an intorduction to data.table package that we use extensively in this analysis. Some advanced tips and tricks with data.table are covered here.

Data description

Data come from a microfluidic experiment where PC-12 cells were stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations. It comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.

The microfluidic device has four separate channels, thus 4 different treatments can be performed in a single experiment.

The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase, an end point of the MAPK signalling pathway. Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.

Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution. Each channel of a microfluidic device was imaged at 4 different fields of view. Thus, we have 16 fields of view per experiment.

The folder data contains 3 sets of files that correspond to 3 growth factor treatments at 4 different concentrations. A single set from a single experiment includes:

tCoursesSelected_xGFsust_expXXXX.csv.gz a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:

experimentDescription_EGFsust.xlsx an Excel file with experiment description. Columns:

Load data

The aim is to load 3 files that correspond to 3 GF treatments and to combine them into a single data.table object that includes info about the experiment.

Brute-force way

Since we only have 3 files we can just load data into 3 separate variables and concatenate them later.

# Load individual files into 3 separate variables
dtEGF = fread(file.path(lPar$dirRoot, lPar$dirData, "original/tCoursesSelected_EGFsust_exp20170316.csv.gz"))
dtFGF = fread(file.path(lPar$dirRoot, lPar$dirData, "original/tCoursesSelected_FGFsust_exp20170224.csv.gz"))
dtNGF = fread(file.path(lPar$dirRoot, lPar$dirData, "original/tCoursesSelected_NGFsust_exp20170725.csv.gz"))

A quick peek into one of the tables reveals the content. There is no info about the experiment, nor the treatment.

head(dtEGF)
##    Metadata_Series Metadata_T TrackObjects_Label
## 1:               0          0                  3
## 2:               0          0                  4
## 3:               0          0                  8
## 4:               0          0                 12
## 5:               0          0                 13
## 6:               0          0                 15
##    Intensity_MeanIntensity_Ratio
## 1:                      1229.246
## 2:                      1279.513
## 3:                      1236.920
## 4:                      1170.460
## 5:                      1150.294
## 6:                      1128.049

All experimental data is in separate Excel files that we have to read as well.

dtEGFexp = as.data.table(read_xlsx(file.path(lPar$dirRoot, lPar$dirData, "original/experimentDescription_EGFsust.xlsx")))
dtFGFexp = as.data.table(read_xlsx(file.path(lPar$dirRoot, lPar$dirData, "original/experimentDescription_FGFsust.xlsx")))
dtNGFexp = as.data.table(read_xlsx(file.path(lPar$dirRoot, lPar$dirData, "original/experimentDescription_NGFsust.xlsx")))

head(dtEGFexp)
##    Position Channel Stim_Duration Stim_Conc Stim_Treat
## 1:        0       1          sust 250 ng/ml        EGF
## 2:        1       1          sust 250 ng/ml        EGF
## 3:        2       1          sust 250 ng/ml        EGF
## 4:        3       1          sust 250 ng/ml        EGF
## 5:        4       2          sust  25 ng/ml        EGF
## 6:        5       2          sust  25 ng/ml        EGF
##    Acquisition_frequency_min Equilibration_min   Device_type
## 1:                         2                43 STD 4-channel
## 2:                         2                43 STD 4-channel
## 3:                         2                43 STD 4-channel
## 4:                         2                43 STD 4-channel
## 5:                         2                43 STD 4-channel
## 6:                         2                43 STD 4-channel

We have to merge experimental data with experiment description. The single-cell data stored in dtXGF contains only information about the field of view in a column Metadata_Series. Experimental data in dtXGFexp contains the information about the treatment for every field of view from the Position column. The merge operation will assign experimental description for every row in dtXGF by matching the field of view. As a result, the dtXGF will be expanded with additional columns with a description of the treatment.

Note: we will use only some columns from the experimental description.

# columns tha we want from experimental description files
vsExpDescrCols = c("Position", "Channel", "Stim_Conc", "Stim_Treat")

# merge data
dtEGF = merge(dtEGF, dtEGFexp[, ..vsExpDescrCols], by.x = c("Metadata_Series"), by.y = "Position")
dtFGF = merge(dtFGF, dtFGFexp[, ..vsExpDescrCols], by.x = c("Metadata_Series"), by.y = "Position")
dtNGF = merge(dtNGF, dtNGFexp[, ..vsExpDescrCols], by.x = c("Metadata_Series"), by.y = "Position")

head(dtEGF)
##    Metadata_Series Metadata_T TrackObjects_Label
## 1:               0          0                  3
## 2:               0          0                  4
## 3:               0          0                  8
## 4:               0          0                 12
## 5:               0          0                 13
## 6:               0          0                 15
##    Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
## 1:                      1229.246       1 250 ng/ml        EGF
## 2:                      1279.513       1 250 ng/ml        EGF
## 3:                      1236.920       1 250 ng/ml        EGF
## 4:                      1170.460       1 250 ng/ml        EGF
## 5:                      1150.294       1 250 ng/ml        EGF
## 6:                      1128.049       1 250 ng/ml        EGF

Now, with 3 experimental results loaded and merged with experimental data, we can make a single data object. We will just “put” one data-set after another in a row-bind operation using rbindlist function.

# let's make a list of our data tables
lData = list(EGF = dtEGF, FGF = dtFGF, NGF = dtNGF)

# clean unnecessary objects
rm(dtEGF, dtFGF, dtNGF)
rm(dtEGFexp, dtFGFexp, dtNGFexp)

# let's bind it:
dtAll = rbindlist(lData)
## Error in rbindlist(lData): Item 2 has 9 columns, inconsistent with item 1 which has 7 columns. To fill missing columns use fill=TRUE.

Side note: since we created a named list lData with 3 data tables, we can now access list elements by reference instead of an index, e.g.:

head(lData[["NGF"]])
##    Metadata_Series Metadata_T TrackObjects_Label
## 1:               0          0                  1
## 2:               0          0                  4
## 3:               0          0                  7
## 4:               0          0                 11
## 5:               0          0                 13
## 6:               0          0                 14
##    Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
## 1:                      1223.122       1 250 ng/ml        NGF
## 2:                      1179.491       1 250 ng/ml        NGF
## 3:                      1265.161       1 250 ng/ml        NGF
## 4:                      1213.916       1 250 ng/ml        NGF
## 5:                      1153.446       1 250 ng/ml        NGF
## 6:                      1136.897       1 250 ng/ml        NGF

The row-bind operation produced an error because one of the tables has a different number of columns from the rest. So, what gives?

# extract column names from every dt in a list
lColNames = lapply(lData, names)

lColNames
## $EGF
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"
## [5] "Channel"                       "Stim_Conc"                    
## [7] "Stim_Treat"                   
## 
## $FGF
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"
## [5] "Location_Center_X"             "Location_Center_Y"            
## [7] "Channel"                       "Stim_Conc"                    
## [9] "Stim_Treat"                   
## 
## $NGF
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"
## [5] "Channel"                       "Stim_Conc"                    
## [7] "Stim_Treat"

Seems like dtFGF contains additional two columns. Let’s select column names that are common in all 3 data tables. We will use a function intersect that looks for common elements of two vectors. Since we need to find common elements in all 3 vectors with column names, we will use Reduce function that applies intersect to consecutive pairs from the lColNames list, i.e. 1-2, (the result of 1-2)-3, etc.

vsCommonColNames = Reduce(intersect, lColNames)
vsCommonColNames
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"
## [5] "Channel"                       "Stim_Conc"                    
## [7] "Stim_Treat"

Let’s try row-binding again but with an option fill=TRUE which fills NA’s in columns that are not present. Then we will choose only common columns from each dt:

# row-bind
dtAll = rbindlist(lData, fill = T, idcol = "Stim_Treat")

# choose only relevant columns
dtAll = dtAll[, ..vsCommonColNames]

# remove unnecessary objects
rm(lData, lColNames, vsCommonColNames)

head(dtAll)
##    Metadata_Series Metadata_T TrackObjects_Label
## 1:               0          0                  3
## 2:               0          0                  4
## 3:               0          0                  8
## 4:               0          0                 12
## 5:               0          0                 13
## 6:               0          0                 15
##    Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
## 1:                      1229.246       1 250 ng/ml        EGF
## 2:                      1279.513       1 250 ng/ml        EGF
## 3:                      1236.920       1 250 ng/ml        EGF
## 4:                      1170.460       1 250 ng/ml        EGF
## 5:                      1150.294       1 250 ng/ml        EGF
## 6:                      1128.049       1 250 ng/ml        EGF

Read files in a loop (advanced)

This is a much more convenient way to batch read a large number of files.

# create a list of files to read by searching a folder for files according to a pattern
vsInputFiles = list.files(path = file.path(lPar$dirRoot, lPar$dirData, "original"), 
                          pattern = "*.csv.gz", 
                          full.names = T)

# apply a file-reading function to the list of file names
lData = lapply(vsInputFiles, fread)

Each element of the list lData contain a data.table. There are 3 elements that correspond to the 3 files. It is handy to add an information which file the data came from. It will be useful later on to differentiate data between conditions. One way to achieve that is to give list elements a name. The name will be the growth factor name extracted from the file name using a gsub function and regular expressions.

# assign names to list elements; extract growth factor name from the file path
names(lData) = gsub(".*_(.*)sust_.*", "\\1", vsInputFiles)

names(lData)
## [1] "EGF" "FGF" "NGF"

Now, we are in a position to bind all list elements into a single data.table:

# combine data into a single object; the option idcol adds a column that corresponds to the name of the element list
dtAll = rbindlist(lData, use.names = T, idcol = "Stim_Treat")
## Error in rbindlist(lData, use.names = T, idcol = "Stim_Treat"): Item 2 has 6 columns, inconsistent with item 1 which has 4 columns. To fill missing columns use fill=TRUE.

We have a problem because data tables do not have the same number of columns:

lapply(lData, names)
## $EGF
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"
## 
## $FGF
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"
## [5] "Location_Center_X"             "Location_Center_Y"            
## 
## $NGF
## [1] "Metadata_Series"               "Metadata_T"                   
## [3] "TrackObjects_Label"            "Intensity_MeanIntensity_Ratio"

FGF data has additional 2 columns. We can use fill = T option to rbindlist which will fill columns that do no exist with NA’s. The we will choose only those columns that existed in all list elements.

# create a vector with common column names that exist in all list elements
vsCommonColNames = Reduce(intersect, lapply(lData, names))

# combine data into a single object; the option idcol adds a column that corresponds to the name of the element list
dtAll = rbindlist(lData, use.names = T, fill = T, idcol = "Stim_Treat")

# add the new column name Stim_Treat
vsCommonColNames = c(vsCommonColNames, "Stim_Treat")

# retain anly some column names
dtAll = dtAll[, ..vsCommonColNames, with = F]

head(dtAll)
##    Metadata_Series Metadata_T TrackObjects_Label
## 1:               0          0                  3
## 2:               0          0                  4
## 3:               0          0                  8
## 4:               0          0                 12
## 5:               0          0                 13
## 6:               0          0                 15
##    Intensity_MeanIntensity_Ratio Stim_Treat
## 1:                      1229.246        EGF
## 2:                      1279.513        EGF
## 3:                      1236.920        EGF
## 4:                      1170.460        EGF
## 5:                      1150.294        EGF
## 6:                      1128.049        EGF

Finally, we need to add experimental data:

# create a list of files to read by searching a folder for files according to a pattern
vsExpFiles = list.files(path = file.path(lPar$dirRoot, lPar$dirData, "original"), 
                          pattern = "*.xlsx", 
                          full.names = T)

# apply a file-reading function to the list of file names
lExp = lapply(vsExpFiles, read_xlsx)

# bind list elements into a single `data.table` object
dtExp = as.data.table(rbindlist(lExp))

The merge of single-cell data with experimental description is done by treatment (Stim_Treat) and field of view (Position / Metadata_Series columns). We will use only some columns from dtExp.

vsExpDescrCols = c("Position", "Channel", "Stim_Conc", "Stim_Treat")

dtAll = merge(dtAll, dtExp[, ..vsExpDescrCols], by.x = c("Stim_Treat", "Metadata_Series"), by.y = c("Stim_Treat", "Position"))

# remove unnecessary objects
rm(lData, lExp, dtExp, vsExpDescrCols, vsInputFiles, vsExpFiles, vsCommonColNames)

dtAll
##         Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
##      1:        EGF               0          0                  3
##      2:        EGF               0          0                  4
##      3:        EGF               0          0                  8
##      4:        EGF               0          0                 12
##      5:        EGF               0          0                 13
##     ---                                                         
## 120082:        NGF              15        100                 73
## 120083:        NGF              15        100                 71
## 120084:        NGF              15        100                 80
## 120085:        NGF              15        100                 78
## 120086:        NGF              15        100                 79
##         Intensity_MeanIntensity_Ratio Channel  Stim_Conc
##      1:                      1229.246       1  250 ng/ml
##      2:                      1279.513       1  250 ng/ml
##      3:                      1236.920       1  250 ng/ml
##      4:                      1170.460       1  250 ng/ml
##      5:                      1150.294       1  250 ng/ml
##     ---                                                 
## 120082:                      1232.113       4 0.25 ng/ml
## 120083:                      1255.342       4 0.25 ng/ml
## 120084:                      1250.922       4 0.25 ng/ml
## 120085:                      1300.154       4 0.25 ng/ml
## 120086:                      1185.412       4 0.25 ng/ml

Milestone 1

Let’s save the resulting data as a milestone for future reference:

sFilePathTmp = file.path(lPar$dirRoot, lPar$dirData, "m1_allGF_wExpDescr.csv")

fwrite(x = dtAll, file = sFilePathTmp, row.names = F)

# Compress the file to save space
gzip(sFilePathTmp, overwrite = T)

rm(sFilePathTmp)

If your are completely lost, here’s a file with the data-set resulting from previous steps:

dtAll = fread(file.path(lPar$dirRoot, lPar$dirData, "m1_allGF_wExpDescr.csv.gz"))

Column names

So far, we have hard-coded column names in multiple points throughout the script, which is a very bad practice! Column names can change, which would then require multiple changes in the code. Let’s at least store the names of all columns in one place and make the code more robust.

# create a list with strings for column names

lCol = list(
  fov = "Metadata_Series",
  frame = "Metadata_T",
  trackid = "TrackObjects_Label",
  meas = "Intensity_MeanIntensity_Ratio",
  measNorm = "Intensity_MeanIntensity_Ratio_Norm",
  ch = "Channel",
  treatGF = "Stim_Treat",
  treatConc = "Stim_Conc",
  trackiduni = "TrackObjects_Label_Uni", # we will add this column later to our data
  clid = "Cluster_Label"
)

If required, column names can be changed only here and the remaining code can be left intact. Alternatively, column name definitions can be isolated to command line parameters or an external file that is read during script’s execution.

Some maintenance

Before we proceed, let’s make our lives easier and add a column with a unique track identifier. The current TrackObjects_Label column contains track ID’s that are unique only within a single field of view. To make them unique, we shall combine the Stim_Treat, Metadata_Series, and TrackObjects_Label columns into a string using the good old sprintf function.

dtAll[, (lCol$trackiduni) := sprintf("%s_%02d_%04d", get(lCol$treatGF), get(lCol$fov), get(lCol$trackid))]
head(dtAll)
##    Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
## 1:        EGF               0          0                  3
## 2:        EGF               0          0                  4
## 3:        EGF               0          0                  8
## 4:        EGF               0          0                 12
## 5:        EGF               0          0                 13
## 6:        EGF               0          0                 15
##    Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
## 1:                      1229.246       1 250 ng/ml            EGF_00_0003
## 2:                      1279.513       1 250 ng/ml            EGF_00_0004
## 3:                      1236.920       1 250 ng/ml            EGF_00_0008
## 4:                      1170.460       1 250 ng/ml            EGF_00_0012
## 5:                      1150.294       1 250 ng/ml            EGF_00_0013
## 6:                      1128.049       1 250 ng/ml            EGF_00_0015

A unique track ID will allow us to easily pick individual time series, plot them, etc.

Another step to make our lives easier is to key the data table. By setting a key to the unique track id, we will be able to easily subset our data as we shall see in a moment.

setkeyv(dtAll, lCol$trackiduni)

Finally we will remove columns that we don’t need anymore:

dtAll[, c(lCol$fov, lCol$trackid, lCol$ch) := NULL]

Data exploration

From now we will access column names of dtAll using variables defined above instead of typing the column name explicitly.

For example to calculate a 5-point summary of the measurement column, we write the following:

summary(dtAll[[lCol$meas]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   677.9  1196.4  1246.3  1257.3  1311.9  4060.9      18


and we have some missing data. Let’s inspect the rows with NA’s.

Missing data

NA’s

head(dtAll[is.na(get(lCol$meas))])
##    Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio Stim_Conc
## 1:        FGF         11                            NA 2.5 ng/ml
## 2:        FGF         11                            NA 2.5 ng/ml
## 3:        FGF         11                            NA 2.5 ng/ml
## 4:        FGF         11                            NA 2.5 ng/ml
## 5:        FGF         11                            NA 2.5 ng/ml
## 6:        FGF         11                            NA 2.5 ng/ml
##    TrackObjects_Label_Uni
## 1:            FGF_08_0001
## 2:            FGF_08_0002
## 3:            FGF_08_0005
## 4:            FGF_08_0006
## 5:            FGF_08_0007
## 6:            FGF_08_0008

All NA’s come from a single time point in the same FOV. Different tracks are affected. One way to deal with NA’s is to impute them with interpolated values. A package imputeTS has a handy function na_interpolation. But do we have enough data around NA’s to interpolate? How long are time series with NA’s?

Let’s print one such time series:

dtAll[get(lCol$trackiduni) == "FGF_08_0001"][[lCol$meas]]
##   [1] 1154.882 1153.657 1150.764 1147.915 1140.173 1142.111 1140.136
##   [8] 1140.837 1144.007 1135.256 1132.608       NA 1133.615 1134.720
##  [15] 1145.947 1130.349 1135.619 1138.593 1139.691 1137.407 1138.597
##  [22] 1130.640 1166.818 1184.005 1162.116 1173.394 1196.259 1202.515
##  [29] 1235.769 1236.971 1272.345 1266.500 1251.723 1228.215 1226.551
##  [36] 1218.169 1214.489 1229.251 1236.079 1231.706 1226.196 1227.705
##  [43] 1212.287 1195.712 1198.995 1202.976 1222.753 1215.568 1202.693
##  [50] 1217.286 1217.344 1218.442 1221.851 1226.046 1217.408 1192.199
##  [57] 1185.705 1181.744 1191.051 1200.445 1198.971 1203.541 1215.357
##  [64] 1215.048 1213.177 1226.709 1222.884 1236.786 1216.284 1220.544
##  [71] 1207.355 1219.985 1210.953 1221.475 1233.104 1236.179 1239.964
##  [78] 1238.919 1235.951 1219.092 1225.653 1225.001 1227.338 1217.194
##  [85] 1225.857 1242.707 1241.310 1241.725 1221.622 1219.596 1230.118
##  [92] 1236.985 1227.344 1233.488 1248.268 1242.010 1233.652 1228.994
##  [99] 1231.844 1243.105 1255.455

Let’s also check the length of all other time series with NA’s.

# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]]
vsTracksWithNAs
##  [1] "FGF_08_0001" "FGF_08_0002" "FGF_08_0005" "FGF_08_0006" "FGF_08_0007"
##  [6] "FGF_08_0008" "FGF_08_0010" "FGF_08_0011" "FGF_08_0014" "FGF_08_0017"
## [11] "FGF_08_0019" "FGF_08_0020" "FGF_08_0021" "FGF_08_0022" "FGF_08_0023"
## [16] "FGF_08_0024" "FGF_08_0025" "FGF_08_0027"
# calculate the number of time points of tracks with NA's
dtAll[vsTracksWithNAs, .N, by = c(lCol$trackiduni)]
##     TrackObjects_Label_Uni   N
##  1:            FGF_08_0001 101
##  2:            FGF_08_0002 101
##  3:            FGF_08_0005 101
##  4:            FGF_08_0006 101
##  5:            FGF_08_0007 101
##  6:            FGF_08_0008 101
##  7:            FGF_08_0010 101
##  8:            FGF_08_0011 100
##  9:            FGF_08_0014 101
## 10:            FGF_08_0017 101
## 11:            FGF_08_0019 101
## 12:            FGF_08_0020 101
## 13:            FGF_08_0021 101
## 14:            FGF_08_0022 101
## 15:            FGF_08_0023 101
## 16:            FGF_08_0024 101
## 17:            FGF_08_0025 101
## 18:            FGF_08_0027 101

Notice that we have subset the table using a vector of unique track id’s vsTracksWithNAs by using dtAll[vsTracksWithNAs, ...] syntax. This is possible because we have keyed the table by the TrackObjects_Label_Uni column. Whatever is provided to the i argument of the data table is therefore automatically assumed to be the key.

Notice 2 looks like we have another problem
 The track FGF_08_0011 has one time point less than others. What’s going on? Is it a more common problem?

Missing time points

Let’s have a look at the min and max of the number of time points per time series. The .N calculates the number of rows, and this stat is calculate per unique time series id.

summary(dtAll[, .N, by = c(lCol$trackiduni)][["N"]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   101.0   101.0   100.9   101.0   101.0

Some rows with time points are completely missing; there’s not even NA in the measurement column. We will add these rows and set NA’s to the measurement. Then we will interpolate.

To stretch the data we will use a neat solution based on this. Just accept that it works


# key the table with all non-measurement columns
vsGroupingCols = c(lCol$treatGF, lCol$treatConc, lCol$trackiduni)

setkeyv(dtAll, c(vsGroupingCols, lCol$frame))

# make a temporary table with a full sequence of time points for every grouping present
# equivalently, you could use data.table::CJ or base::grid.expand
dtTmp = dtAll[,
              .(min(get(lCol$frame), na.rm = T):max(get(lCol$frame), na.rm = T)),
              by = vsGroupingCols]

# key by grouping columns
setkeyv(dtTmp, c(vsGroupingCols, "V1"))

# subset original dt with the temporary dt; it's equivalent to left-join
dtAll = dtAll[dtTmp]

# clean
rm(dtTmp)

Data imputation

We have many more rows with measurements set to NA:

head(dtAll[is.na(get(lCol$meas))], 10)
##     Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio  Stim_Conc
##  1:        EGF         68                            NA 0.25 ng/ml
##  2:        EGF         20                            NA 0.25 ng/ml
##  3:        EGF         99                            NA 0.25 ng/ml
##  4:        EGF         99                            NA 0.25 ng/ml
##  5:        EGF         99                            NA 0.25 ng/ml
##  6:        EGF         99                            NA 0.25 ng/ml
##  7:        EGF         66                            NA 0.25 ng/ml
##  8:        EGF         99                            NA 0.25 ng/ml
##  9:        EGF         99                            NA 0.25 ng/ml
## 10:        EGF         99                            NA 0.25 ng/ml
##     TrackObjects_Label_Uni
##  1:            EGF_13_0020
##  2:            EGF_13_0023
##  3:            EGF_14_0002
##  4:            EGF_14_0010
##  5:            EGF_14_0015
##  6:            EGF_14_0017
##  7:            EGF_14_0018
##  8:            EGF_14_0032
##  9:            EGF_14_0035
## 10:            EGF_14_0049

We will use na_interpolation function from imputeTS package:

dtAll[, (lCol$meas) := na_interpolation(get(lCol$meas)), by = c(lCol$trackiduni)]

dtAll[is.na(get(lCol$meas))]
## Empty data.table (0 rows and 5 cols): Stim_Treat,Metadata_T,Intensity_MeanIntensity_Ratio,Stim_Conc,TrackObjects_Label_Uni

No more NA’s


Milestone 2

Let’s save the resulting data as a milestone for future reference:

sFilePathTmp = file.path(lPar$dirRoot, lPar$dirData, "m2_allGF_wExpDescr_noNAs.csv")

fwrite(x = dtAll, file = sFilePathTmp, row.names = F)

# Compress the file to save space
gzip(sFilePathTmp, overwrite = T)

rm(sFilePathTmp, dtTmp, vsGroupingCols, vsTracksWithNAs)

If your are completely lost, here’s a file with the data-set resulting from previous steps:

dtAll = fread(file.path(lPar$dirRoot, lPar$dirData, "m2_allGF_wExpDescr_noNAs.csv.gz"))

Plotting

Finally, we can plot some data!

Single-cell data

p0 = ggplot(dtAll, aes_string(x = lCol$frame, y = lCol$meas, group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) # parameter alpha adds transparency

p0


and we have outliers. Actually, two kinds of outliers. Some time series have single time points that are way above the average. There are also several time series that are behaving entirely differently to the rest, so called dynamic outliers.

Point outliers

Let’s remove single time points above the threshold 2000, and impute them with interpolated data.

dtAll[get(lCol$meas) > 2000]
##    Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio Stim_Conc
## 1:        EGF         22                      4060.864 250 ng/ml
## 2:        FGF         65                      2085.110  25 ng/ml
##    TrackObjects_Label_Uni
## 1:            EGF_00_0022
## 2:            FGF_05_0018
# replace time points with measurements above the threshold with NA's
dtAll[get(lCol$meas) > 2000, (lCol$meas) := NA]

# interpolate NA's (again)
dtAll[, (lCol$meas) := na_interpolation(get(lCol$meas)), by = c(lCol$trackiduni)]

Dynamic outliers

Let’s remove entire trajectories if the measurement is below 1000:

# vector with unique track id's of dynamic outliers
vsTrackOut = unique(dtAll[get(lCol$meas) < 1000][[lCol$trackiduni]])

# leave only tracks that are not in the outlier vector
dtAll = dtAll[!(get(lCol$trackiduni) %in% vsTrackOut)]

# clean
rm(vsTrackOut)

Milestone 3

Let’s save the resulting data as a milestone for future reference:

sFilePathTmp = file.path(lPar$dirRoot, lPar$dirData, "m3_allGF_wExpDescr_noNAs_noOut.csv")

fwrite(x = dtAll, file = sFilePathTmp, row.names = F)

# Compress the file to save space
gzip(sFilePathTmp, overwrite = T)

rm(sFilePathTmp)

If your are completely lost, here’s a file with the data-set resulting from previous steps:

dtAll = fread(file.path(lPar$dirRoot, lPar$dirData, "m3_allGF_wExpDescr_noNAs_noOut.csv.gz"))

Plotting (continued)

Plot per condition

# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame, y = lCol$meas, group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) # parameter alpha adds transparency

# The facet_grid accepts a formula with column names, e.g.
# facet_grid(Stim_Treat ~ Stim_Conc)
# An equivalent solution to use variables with string column names is:
# facet_grid(as.formula(paste0(lCol$treatGF, "~", lCol$treatConc)))
# Or, to use reformulate function that creates a formula from strings
p1 = p0 +
  facet_grid(reformulate(lCol$treatGF, lCol$treatConc))


p1

Normalisation

The baseline level before stimulation is very heterogeneous (due to biological and technical noise). At this stage, we do not care about that variability, we only want to cluster the responses. Thus, we might get better results by normalising every time series to its baseline (i.e. first 20 time points).

# add a column with the mean of the beasline for every time series
dtAll[, 
      baseline := mean(.SD[1:20, get(lCol$meas)]), # add a new column with the mean of first 20 rows of the group
      by = c(lCol$trackiduni)] # group by unique trajectory

# add a column with normalized measurement
dtAll[,
      (lCol$measNorm) := get(lCol$meas) / baseline]

# remove baseline column
dtAll[, baseline := NULL]

Plot per condition using normalized data

# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame, y = lCol$measNorm, group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) # parameter alpha adds transparency

p1 = p0 +
  facet_grid(reformulate(lCol$treatGF, lCol$treatConc))

p1

Milestone 4

Let’s save the resulting data as a milestone for future reference:

sFilePathTmp = file.path(lPar$dirRoot, lPar$dirData, "m4_allGF_wExpDescr_noNAs_noOut_norm0-20.csv")

fwrite(x = dtAll, file = sFilePathTmp, row.names = F)

# Compress the file to save space
gzip(sFilePathTmp, overwrite = T)

rm(sFilePathTmp)

If your are completely lost, here’s a file with the dataset resulting from previous steps:

dtAll = fread(file.path(lPar$dirRoot, lPar$dirData, "m4_allGF_wExpDescr_noNAs_noOut_norm0-20.csv.gz"))

Plotting (continued 2)

Adding stats

The mean

p2 = p1 +
  stat_summary(
      aes_string(y = lCol$measNorm, group = 1),
      fun.y = mean,
      colour = 'red',
      linetype = 'solid',
      size = 1,
      geom = "line",
      group = 1
    )

p2

Confidence intervals

p3 = p2 +
    stat_summary(
      aes_string(y = lCol$measNorm, group = 1),
      fun.data = mean_cl_normal,
      colour = 'red',
      alpha = 0.25,
      geom = "ribbon",
      group = 1, 
      linetype = "dashed"
    )
p3

Interactive plots

For the sake of clarity, we choose only one condition for plotting:

pInt = ggplot(dtAll[get(lCol$treatGF) == "NGF" & get(lCol$treatConc) == "0.25 ng/ml"], 
              aes_string(x = lCol$frame, 
                         y = lCol$measNorm, 
                         group = lCol$trackiduni)) +
  geom_path(alpha = 0.1)

ggplotly(pInt)

Beautifying plots

Add themes, e.g. + theme_bw(), or themes available in packages such as ggthemes.

Use `+ labels() to add the title, subtitle, the caption.

Use + xlab() or + ylab() to control labels of x and y axes.

p3 +
  theme_minimal() +
  labs(title = "ERK activity in response to sustained GF treatments",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("Time (min)") +
  ylab("ERK activity")

Plotting amplitude at selected time points

# define a vector with selected time points
vnTpts = c(10, 25, 50)

# plot
ggplot(dtAll[get(lCol$frame) %in% vnTpts], 
       aes_string(x = paste0("as.factor(", lCol$frame, ")"), y = lCol$measNorm)) +
  # geom_dotplot(binaxis = "y", 
  #              stackdir = "center", 
  #              position = "dodge",
  #              alpha = 0.1, 
  #              binwidth = .008, 
  #              binpositions = "all") +
  geom_violin(aes_string(fill = paste0("as.factor(", lCol$frame, ")"))) +
  geom_boxplot(fill = NA,
               outlier.shape = NA) + 
  facet_grid(reformulate(lCol$treatGF, lCol$treatConc)) +
  scale_fill_discrete("Time (min):") +
  theme_minimal() +
  labs(title = "Distribution of ERK activity at selected time points",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("Time (min)") +
  ylab("ERK activity")

Hierarchical clustering

The package gplots offers a handy function heatmap.2 to plot the heat-map and to cluster data. To use it we need to provide our single-cell dataset in a wide format as a matrix. Such a matrix contains individual time series as rows. Columns correspond to time points.

Convert from long to wide format

# long to wide format
# every row corresponds to a single time series; column are time points
dtAllWide = dcast(dtAll, TrackObjects_Label_Uni ~ Metadata_T, value.var = lCol$measNorm)

# obtain row names for later
vsRowNames = dtAllWide[[lCol$trackiduni]]

# make a matrix
mAllWide = as.matrix(dtAllWide[, -1])

# assign row names to the matrix (useful for later plotting heatmaps from clustering)
rownames(mAllWide) = vsRowNames

Plotting the heat-map

heatmap.2(mAllWide)

We need to tune the default parameters. First, we only want to cluster according to rows only. Clustering according to columns doesn’t make sense for us because these are time points whose order we need to maintain in the plot.

Tune:

  • the dendrogram for columns can be removed with the option dendrogram = "row" and clustering according to columns is removed with Colv = F,
  • display density of measured values on the colour key with density.info = "density",
  • apply a custom palette, col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)).
  • add labels with xlab, ylab, and key.lab.
heatmap.2(mAllWide, 
          dendrogram = "row",
          Colv = F,
          trace = "none",
          density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)",
          ylab = "Cells", 
          key.xlab = "ERK activity"
          )

Piping

Before we proceed, let’s introduce a very handy concept in R called piping. Pipes are an operator, %>%, and allow to express a sequence of operations without creating intermediate variables. The pipe operator comes with the magrittr package, which is loaded automatically by tidyverse packages.

Colour dendrogram branches

# create a coloured dendrogram using a given distance and a linkage method
myRowDend  <- mAllWide %>% 
  proxy::dist(., "euclidean") %>% # calculate distance
  stats::hclust(., "complete") %>% # create a dendrogram
  stats::as.dendrogram(.) %>%
  dendextend::set("branches_k_color", k = 4) %>% # color k main branches
  dendextend::set("branches_lwd", 2) %>% # set line width of the dendrogram
  dendextend::ladderize(.) # reorganize the dendrogram

heatmap.2(mAllWide, 
          dendrogram = "row",
          Rowv = myRowDend, # use our coloured dendrogram to order rows
          Colv = F,
          trace = "none",
          density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)",
          ylab = "Cells", 
          key.xlab = "ERK activity"
          )

Shape clustering

Above approaches treat individual time points as independent measurements. Instead, shape clustering such as Dynamic Time Warping (DTW) is suited for calculating distances between time series.

Some resources:

# From: https://stackoverflow.com/a/50776685/1898713

myRowDend  <- mAllWide %>% 
  proxy::dist(., method = "dtw_basic", pattern = symmetric1, norm = "L1", window.size = 10L) %>% 
  stats::as.dist(.) %>% # conversion required because dtw_basic returns a cross-distance matrix; it is symmetric, thus we take the lower triangle
  stats::hclust(., "complete") %>% 
  stats::as.dendrogram(.) %>%
  dendextend::set("branches_k_color", k = 5) %>% 
  dendextend::set("branches_lwd", 2) %>%
  dendextend::ladderize(.)

heatmap.2(mAllWide, 
          dendrogram = "row",
          Rowv = myRowDend,
          Colv = F,
          trace = "none",
          density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)",
          ylab = "Cells", 
          key.xlab = "ERK activity"
          )

Interactive heat-maps

d3heatmap(mAllWide,
          dendrogram = "row",
          Rowv = myRowDend,
          Colv = F,
          trace = "none",
          show_grid = F,
          density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)",
          ylab = "Cells", 
          key.xlab = "ERK activity")

Distance and linkage methods

We will use a standard R function hclust to perform hierarchical clustering. The first and only required argument of that function is a matrix with all pairwise distances between the time series. We obtain such a matrix with a dist function, which requires data in a wide format.

Distance matrix

mDist = dist(mAllWide, method = "euclidean")

hclust

clRes = hclust(mDist, method = "complete")

Extract clusters

After performing hierarchical clustering and after cutting the dendrogram at a desired level, we shall extract assignments of cluster numbers to individual time series.

Step 1: hierarchical clustering

clTree  <- mAllWide %>% 
  proxy::dist(., method = "dtw_basic", pattern = symmetric1, norm = "L1", window.size = 10L) %>% 
  stats::as.dist(.) %>% # conversion required because dtw_basic returns a cross-distance matrix; it is symmetric, thus we take the lower triangle
  stats::hclust(., "complete") 

str(clRes)
## List of 7
##  $ merge      : int [1:1186, 1:2] -75 -20 -93 -92 -87 -499 -98 -102 -52 -516 ...
##  $ height     : num [1:1186] 0.0858 0.0867 0.0917 0.0973 0.1016 ...
##  $ order      : int [1:1187] 968 117 392 677 187 729 444 993 991 459 ...
##  $ labels     : chr [1:1187] "EGF_00_0003" "EGF_00_0004" "EGF_00_0008" "EGF_00_0012" ...
##  $ method     : chr "complete"
##  $ call       : language hclust(d = mDist, method = "complete")
##  $ dist.method: chr "Euclidean"
##  - attr(*, "class")= chr "hclust"

Step 2: cut the dendrogram

clAssign = dendextend::cutree(clTree, k = 6)
head(clAssign)
## EGF_00_0003 EGF_00_0004 EGF_00_0008 EGF_00_0012 EGF_00_0013 EGF_00_0015 
##           1           2           2           3           1           1

Convert named vector to a data table.

dtClAssign = as.data.table(clAssign, keep.rownames = T)
setnames(dtClAssign, c(lCol$trackiduni, lCol$clid))
head(dtClAssign)
##    TrackObjects_Label_Uni Cluster_Label
## 1:            EGF_00_0003             1
## 2:            EGF_00_0004             2
## 3:            EGF_00_0008             2
## 4:            EGF_00_0012             3
## 5:            EGF_00_0013             1
## 6:            EGF_00_0015             1

Step 3: Merge original time series with cluster assignments for individual time series.

dtAllCl = merge(dtAll, dtClAssign, by = lCol$trackiduni)

# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]

Milestone 5

Let’s save the resulting data as a milestone for future reference:

sFilePathTmp = file.path(lPar$dirRoot, lPar$dirData, "m5_allGF_wExpDescr_noNAs_noOut_norm0-20_cl.csv")

fwrite(x = dtAllCl, file = sFilePathTmp, row.names = F)

# Compress the file to save space
gzip(sFilePathTmp, overwrite = T)

rm(sFilePathTmp)

If your are completely lost, here’s a file with the dataset resulting from previous steps:

dtAllCl = fread(file.path(lPar$dirRoot, lPar$dirData, "m5_allGF_wExpDescr_noNAs_noOut_norm0-20_cl.csv.gz"))

Plot clusters

Time series per cluster

p4 = ggplot(dtAllCl, aes_string(x = lCol$frame, 
                                y = lCol$measNorm, 
                                group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) +
  stat_summary(
      aes_string(y = lCol$measNorm, group = 1),
      fun.y = mean,
      colour = 'red',
      linetype = 'solid',
      size = 1,
      geom = "line",
      group = 1) +
  facet_wrap(lCol$clid) +
  theme_minimal() +
  labs(title = "Clusters of ERK activity dynamic responses sustained GF treatments",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("Time (min)") +
  ylab("ERK activity (normalised)")

p4

Distribution per condition

Now we would like to see which clusters comprise experimental conditions. To achieve that, we perform a simple data aggregation where we calculate the number of time series per group, per cluster:

dtAllClN = dtAllCl[, .(ntc = .N), by = c(lCol$treatGF, lCol$treatConc, lCol$clid)]

Let’s create this plot step by step and use a colour-blind-friendly palette:

# The core plot: concentrations on the x-axis, the number of time series on y-axis
p5 = ggplot(dtAllClN, aes_string(x = lCol$treatConc, y = "ntc"))

# Facetting per growth factor
p5 = p5 +
  facet_wrap(lCol$treatGF)
  
# Stacked bar plot with bars coloured by cluster number
# Bars are stretched to "fill" the y-axis using the option position = position_fill()
p5 = p5 +
  geom_bar(aes_string(fill = lCol$clid), 
           stat = "identity", 
           position = position_fill())

# Convert y-axis labels to percentages (0-100%) instead of (0-1)
p5 = p5 +
  scale_y_continuous(labels = percent)

# Use a nice colour palette for colour fill
p5 = p5 +
  scale_fill_manual("GF:", values = ggthemes::tableau_color_pal("Color Blind")(6))
  
# Prettify the plot; add labels, etc
p5 = p5 +
  theme_minimal() +
  labs(title = "Participation of clusters in experimental conditions",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("") +
  ylab("Percentage") +
  theme(axis.text.x =  element_text(angle = 45, hjust = 1))

p5

Clustering validation

Based on Clustering Validation Statistics.

Optimal number of clusters - Silhouette method

# Silhouette method
factoextra::fviz_nbclust(mAllWide, hcut, method = "silhouette")+
  labs(subtitle = "Silhouette method")

Optimal number of clusters - Elbow method

# Elbow method
factoextra::fviz_nbclust(mAllWide, hcut, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)+
  labs(subtitle = "Elbow method")

Optimal number of clusters - Gap statistic method

# Gap statistic
# nboot = 10 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
factoextra::fviz_nbclust(mAllWide, hcut, nstart = 25,  method = "gap_stat", nboot = 10)+
  labs(subtitle = "Gap statistic method")

30 indices for choosing the best number of clusters

nb <- NbClust::NbClust(mAllWide, 
              distance = "euclidean", 
              min.nc = 2, max.nc = 10, 
              method = "complete", 
              index ="all")
## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
# Visualize the result
factoextra::fviz_nbclust(nb) + theme_minimal()
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 5 proposed  2 as the best number of clusters
## * 9 proposed  3 as the best number of clusters
## * 4 proposed  4 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## * 3 proposed  NA's as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

Clustering analysis

hc.res <- factoextra::eclust(mAllWide, "hclust", k = 3, 
                 hc_method = "complete", hc_metric = "euclidean",
                 graph = FALSE)
# Visualize clusters
factoextra::fviz_cluster(hc.res, geom = "point", frame.type = "norm")
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.

factoextra::fviz_silhouette(hc.res)
##   cluster size ave.sil.width
## 1       1  893          0.21
## 2       2  274          0.48
## 3       3   20          0.48

Silhouette analysis

sil <- cluster::silhouette(hc.res$cluster, dist(mAllWide))
head(sil[, 1:3], 10)
##       cluster neighbor  sil_width
##  [1,]       1        3 0.39088711
##  [2,]       1        3 0.40465167
##  [3,]       1        3 0.25854173
##  [4,]       1        2 0.18178423
##  [5,]       1        3 0.42406768
##  [6,]       1        3 0.42126024
##  [7,]       2        1 0.31029735
##  [8,]       1        3 0.07900846
##  [9,]       1        3 0.40394547
## [10,]       1        3 0.21347850
# Silhouette plot
factoextra::fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  893          0.21
## 2       2  274          0.48
## 3       3   20          0.48